I would like to predict if high scoring determines if you are on a good team or not. A lot of times there are players who score a lot but aren’t on a good team. For many people including fans and scouts it will determine a lot about the player if he is not a good team or not. If that player is on a good team chances are that player has a great upside and potential. I will determine if high scoring means your are on a good team or not. I will do this through my second data source that stores the top 25 ranked teams in the nation and will use data from my first deliverable.

include <- function(library_name){
  if( !(library_name %in% installed.packages()) )
    install.packages(library_name) 
  library(library_name, character.only=TRUE)
}
include("tidyverse")
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
include("knitr")
include("caret")
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
include("rvest")
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
include("dplyr")


purl("deliverable1.Rmd", output="part1.r")
## 
## 
## processing file: deliverable1.Rmd
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |.........                                                        |  14%
  |                                                                       
  |...................                                              |  29%
  |                                                                       
  |............................                                     |  43%
  |                                                                       
  |.....................................                            |  57%
  |                                                                       
  |..............................................                   |  71%
  |                                                                       
  |........................................................         |  86%
  |                                                                       
  |.................................................................| 100%
## output file: part1.r
## [1] "part1.r"
source("part1.r")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Player = col_character(),
##   School = col_character(),
##   Pos = col_character(),
##   labels = col_character(),
##   Status = col_character()
## )
## See spec(...) for full column specifications.
NCAAData <- as.data.frame(NCAAData)

Here is my second data source which includes the ranked teams. I obtained this data source via web scraping.

rankings_web <- read_html("https://www.ncaa.com/rankings/basketball-men/d1/associated-press")

rank <- rankings_web %>%
  html_nodes("tbody") %>%
  html_nodes("tr")

Rank <- rank %>%
  html_nodes("td:first_child") %>%
  html_text() %>%
  as.integer()
Team <- rank %>%
  html_nodes("td:nth_child(2)") %>%
  html_text()
Points <- rank %>%
  html_nodes("td:nth_child(3)") %>%
  html_text()
Record <- rank %>%
  html_nodes("td:nth_child(4)") %>%
  html_text()

Rankings <- cbind.data.frame(Rank=Rank, Team=Team, Points=Points, Record=Record)

#tidying data
Rankings$Team <- gsub(" \\(48\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(3\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(9\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(5\\)", "", Rankings$Team)


colnames(Rankings)[colnames(Rankings)=="Team"] <- "School"
colnames(Rankings)[colnames(Rankings)=="Points"] <- "Team_Points"
colnames(Rankings)[colnames(Rankings)=="Rank"] <- "Team_Rank"
colnames(Rankings)[colnames(Rankings)=="Record"] <- "Team_Record"

Rankings$Team_Points <- gsub(",", "", Rankings$Team_Points)
Rankings$Team_Points <- as.double(Rankings$Team_Points)


NCAAData <- merge(x=NCAAData, y=Rankings, by="School", all.x=TRUE)

Printing out NCAAData Table

(NCAAData)

Printing out Rankings Table

(Rankings)

To get a good idea here is a table of the Ranked teams and their total scoring I used from my second dataset.

ggplot(data=Rankings, aes(x=Points)) +
  geom_jitter(aes(y=School, color=School))+
  labs(title="Top 25 Team Ranked Scoring", x="", y="") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Here I tried to use Points and other columns as an indicator to predict if high scoring determines how good the team is but I did not yield good results. However Rebounds gave me a good indicator to predict that good teams depend on rebounding. In order to use points as a good predictor I would need more paramaters or data.

set.seed(385)
top25 <- filter(NCAAData, !is.na(NCAAData$Team_Points))
sample_selection <- top25$Team_Rank %>%
  createDataPartition(p=0.75, list=FALSE)
train <- top25[sample_selection, ]
test <- top25[-sample_selection, ]
train_model <- lm(Team_Rank ~ Minutes_Played + Points + Total_Rebounds + Assists + Steals + Field_Goal_Average, data=top25)
summary(train_model)
## 
## Call:
## lm(formula = Team_Rank ~ Minutes_Played + Points + Total_Rebounds + 
##     Assists + Steals + Field_Goal_Average, data = top25)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.210  -6.784   0.245   6.368  13.199 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        16.3392013  1.3291293  12.293   <2e-16 ***
## Minutes_Played      0.0001129  0.0002674   0.422   0.6730    
## Points              0.0910102  0.0775013   1.174   0.2406    
## Total_Rebounds     -0.1663744  0.0713556  -2.332   0.0199 *  
## Assists            -0.1323454  0.1343128  -0.985   0.3247    
## Steals             -0.2625879  0.3192057  -0.823   0.4109    
## Field_Goal_Average -0.1661954  0.1051725  -1.580   0.1144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.406 on 874 degrees of freedom
## Multiple R-squared:  0.008833,   Adjusted R-squared:  0.002028 
## F-statistic: 1.298 on 6 and 874 DF,  p-value: 0.2553
prediction <- train_model %>% predict(test)
R2(prediction, test$Team_Rank)
## [1] 0.01070385

Here I found that that Win Shares per 40 minutes (the entire game time + if they win) is a good indicator to predict that good teams have decent win shares.

train_model <- lm(Team_Rank ~ Total_Rebounds + Win_Shares_per40_Minutes, data=top25)
summary(train_model)
## 
## Call:
## lm(formula = Team_Rank ~ Total_Rebounds + Win_Shares_per40_Minutes, 
##     data = top25)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1135  -6.8410   0.2134   6.3238  13.4419 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              14.41018    0.67263  21.424   <2e-16 ***
## Total_Rebounds           -0.05607    0.06274  -0.894   0.3718    
## Win_Shares_per40_Minutes -7.19916    3.88431  -1.853   0.0642 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.396 on 878 degrees of freedom
## Multiple R-squared:  0.006963,   Adjusted R-squared:  0.004701 
## F-statistic: 3.078 on 2 and 878 DF,  p-value: 0.04654
prediction <- train_model %>% predict(test)
R2(prediction, test$Team_Rank)
## [1] 0.02060821